8 research outputs found

    Integrating node embeddings and biological annotations for genes to predict disease-gene associations

    Get PDF
    Background : Predicting disease causative genes (or simply, disease genes) has played critical roles in understanding the genetic basis of human diseases and further providing disease treatment guidelines. While various computational methods have been proposed for disease gene prediction, with the recent increasing availability of biological information for genes, it is highly motivated to leverage these valuable data sources and extract useful information for accurately predicting disease genes. Results : We present an integrative framework called N2VKO to predict disease genes. Firstly, we learn the node embeddings from protein-protein interaction (PPI) network for genes by adapting the well-known representation learning method node2vec. Secondly, we combine the learned node embeddings with various biological annotations as rich feature representation for genes, and subsequently build binary classification models for disease gene prediction. Finally, as the data for disease gene prediction is usually imbalanced (i.e. the number of the causative genes for a specific disease is much less than that of its non-causative genes), we further address this serious data imbalance issue by applying oversampling techniques for imbalance data correction to improve the prediction performance. Comprehensive experiments demonstrate that our proposed N2VKO significantly outperforms four state-of-the-art methods for disease gene prediction across seven diseases. Conclusions : In this study, we show that node embeddings learned from PPI networks work well for disease geneprediction, while integrating node embeddings with other biological annotations further improves the performanceof classification models. Moreover, oversampling techniques for imbalance correction further enhances the prediction performance. In addition, the literature search of predicted disease genes also shows the effectiveness of our proposed N2VKO framework for disease gene prediction.MOE (Min. of Education, S’pore)Published versio

    Learning graph representations for disease gene prediction

    No full text
    The analysis of disease-causing conditions based on genes and their protein products plays a crucial role in the diagnosis and treatment of several serious diseases such as cancer and diabetes. Since experimental techniques are time-consuming and expensive, computational methods preserve their significance in revealing the functional roles of genes/proteins in the context of many diseases. Molecular networks such as protein-protein interaction and co-expression networks are useful computational tools to figure out the underlying mechanism of diseases by studying the complex interplay between genes. However, molecular networks are often noisy and incomplete. Therefore, computational approaches are designed in such a way that they are able to complement and enhance network data. The vast amount of cumulated data from various scientific domains are exploited to extract useful information for disease gene prediction efficiently. The goal of this research is to develop techniques that extract robust network-based feature representations for prediction of disease-causing genes with a wider perspective by combining both topological characteristics of molecular networks and biological knowledge such as gene ontology and protein domain. However, exploiting the topological arrangements of proteins in the context of not only the interactions between them but also their relevance based on other biological properties is another challenge. Furthermore, hand-engineering to extract complex network features needs tedious efforts with domain expertise. Automating the extraction of these topological features and combining them with the biological aspects of proteins is a challenge to be addressed. For this purpose, node embedding models are applied to automate the extraction of useful feature representations for disease gene prediction. Besides, there are various molecular networks that provide deeper insight into genes and diseases from multiple perspectives. Thus, a unified computational framework can leverage these various perspectives to generate more robust representations. The design of models, which are able to automate the extraction of feature representations through multiple networks is essential to achieve higher prediction performance. In this thesis, we propose three computational frameworks to predict candidate disease genes: 1. The first proposed computational method aims to predict candidate disease genes, which is called Metagraph. To complement and enrich the protein-protein interaction (PPI) networks, Metagraph leverages the biological properties of the individual proteins by integrating the ontological properties, named as keywords, of proteins into the PPI network, and constructs a novel PPI-Keywords (PPIK) network composed of both proteins and keywords as two different types of nodes. As disease proteins tend to exhibit similar topological properties on the PPIK network, we further propose to represent proteins with metagraphs. Apart from a traditional network motif or subgraph, a metagraph is able to capture both topological arrangements involving the interactions between proteins and the associations between the proteins and keywords. Thus, proteins that are not neighbors in a noisy PPI network have a better chance to be topologically similar through keywords. Extended metagraph representations considering the disease occurrences, called Metagraph+, are fed into various classifiers for disease protein prediction in a supervised manner. Conducted experiments show that Metagraph+ consistently improves disease protein prediction on three different PPI databases and outperforms the state-of-the-art baselines including both diffusion-based methods and module-based methods. In addition, predictions of Metagraph+ attain better correlations with the literature findings from PubMed database. 2. Numerous studies for the discovery of disease-associated genes and their part in the development of a disease have been proposed over the last decades. Yet, automatically extracted features from a molecular network, have not been exploited for disease gene prediction. The second proposed technique in this thesis is an integrative framework called N2VKO which adopts a well-known representation learning method, node2vec. We combine network-based feature representations of genes obtained by node2vec with biological aspects of proteins. Then, we apply various feature selection methods to analyze their performance on disease gene prediction task. As the data for disease gene prediction is imbalanced, we further address this data imbalance issue by applying oversampling techniques on our novel representations to improve the prediction performance. Extensive experiments show that N2VKO significantly outperforms four state-of-the-art methods for disease gene prediction across seven diseases. Moreover, the categories of the biological aspects within N2VKO representations are listed to analyze their role in disease formation. We also provided literature evidence for N2VKO biological features over lung cancer. Finally, the literature evidence from PubMed database reveals the effectiveness of our proposed N2VKO framework for disease gene prediction. 3. The third proposed framework in this thesis addresses the candidate disease gene prediction problem. Since there are various biological networks providing different insights of genes, combining them as complementing each other in a unified framework is necessary to improve disease gene prediction performance. We propose a novel unsupervised algorithm for \Multi-view network embedding with Intra-Cross" consistencies (MICROS). This approach learns low-dimensional representations to be fed into various downstream tasks through a multi-view network embedding framework. MICROS is based on two well-known principles: diversity and collaboration. Former enables views to maintain their topological characteristics, the latter enables views to work together and reinforce each other. Unlike existing methods, we also examine a novel form of higher-order collaboration that has not been explored previously on multi-view networks and further integrate it into a unifying framework of consistencies to provide more robust, superior node representations. Finally, we conduct extensive experiments on three real-world multi-view networks. Our results demonstrate that our learned representations consistently outperform state-of-the-art approaches on various downstream tasks namely node-level tasks (i.e., classification and clustering), relationship mining and link prediction.Doctor of Philosoph

    Multi-view collaborative network embedding

    No full text
    Real-world networks often exist with multiple views, where each view describes one type of interaction among a common set of nodes. For example, on a video-sharing network, while two user nodes are linked if they have common favorite videos in one view, they can also be linked in another view if they share common subscribers. Unlike traditional single-view networks, multiple views maintain different semantics to complement each other. In this paper, we propose MANE, a multi-view network embedding approach to learn low-dimensional representations. Similar to existing studies, MANE hinges on diversity and collaboration - while diversity enables views to maintain their individual semantics, collaboration enables views to work together. However, we also discover a novel form of second-order collaboration that has not been explored previously, and further unify it into our framework to attain superior node representations. Furthermore, as each view often has varying importance w.r.t. different nodes, we propose MANE+, an attention-based extension of MANE to model node-wise view importance. Finally, we conduct comprehensive experiments on three public, real-world multi-view networks, and the results demonstrate that our models consistently outperform state-of-the-art approaches.Comment: Accepted for publication in the ACM Transactions on Knowledge Discovery from Data, TKD

    Predicting the Textural Properties of Plant-Based Meat Analogs with Machine Learning

    No full text
    Plant-based meat analogs are food products that mimic the appearance, texture, and taste of real meat. The development process requires laborious experimental iterations and expert knowledge to meet consumer expectations. To address these problems, we propose a machine learning (ML)-based framework to predict the textural properties of meat analogs. We introduce the proximate compositions of the raw materials, namely protein, fat, carbohydrate, fibre, ash, and moisture, in percentages and the “targeted moisture contents” of the meat analogs as input features of the ML models, such as Ridge, XGBoost, and MLP, adopting a build-in feature selection mechanism for predicting “Hardness” and “Chewiness”. We achieved a mean absolute percentage error (MAPE) of 22.9%, root mean square error (RMSE) of 10.101 for Hardness, MAPE of 14.5%, and RMSE of 6.035 for Chewiness. In addition, carbohydrates, fat and targeted moisture content are found to be the most important factors in determining textural properties. We also investigate multicollinearity among the features, linearity of the designed model, and inconsistent food compositions for validation of the experimental design. Our results have shown that ML is an effective aid in formulating plant-based meat analogs, laying out the groundwork to expediently optimize product development cycles to reduce costs
    corecore